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1. Introduction 
1.1. Background 

The need for easily implementable methods for regression problems with large 
number of variables gave rise to an extensive, and growing, literature over the 
last decade. Penalized least squares with i!i-type penalties is among the most 
popular techniques in this area. This method is closely related to restricted least 
squares minimization, under an £i-restriction on the regression coefficients which 
is called the Lasso method, following [24]. We refer to both methods as Lasso- 
type methods. Within the linear regression framework these methods became 
most popular. Let (Zi, Yi), . . . , (Z„, 1^) be a sample of independent random 
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pairs, with Z.^ = {Zu, . . . , Zmj) and Y.^ = XiZu + • • • + XmZmi + VKj, i = 
1, • • • , n, where Wi are independent error terms. Then, for a given T > 0, the 
Lasso estimate of A S M*^ is 

Aiasso = argmin •! - (^i - MZu - ... - XmZmi)'^ \ (1-1) 

where |A|i = X^jli I'^jl- ^ given tuning parameter 7 > 0, the penahzed 
estimate of A e R'^ is 

Apcn = argmin {■^'^{Yi- XiZu XuZuif + l\X\i j- . (1.2) 



Lasso-type methods can be also applied in the nonparametric regression model 
Y = f{X) + W, where / is the unknown regression function and W is an error 
term. They can be used to create estimates for / that are linear combinations of 
basis functions (j)i{X), . . . , (j)M{X) (wavelets, splines, trigonometric polynomials, 
etc) . The vectors of linear coefficients are given by either the Apcn or the Aiasso 
above, obtained by replacing Zji by 4)j{Xi). 

In this paper we analyze ^i-penalized least squares procedures in a more gen- 
eral framework. Let (^1, Yi), . . . , (X„, Yn) be a sample of independent random 
pairs distributed as {X,Y) S {X,M.), where A" is a Borel subset of K"*; we de- 
note the probability measure of X by /i. Let f{X) — E{Y\X) be the unknown 
regression function and J-'m = {/i, • • ■ , /m} be a finite dictionary of real- valued 
functions fj that are defined on X. Depending on the statistical targets, the 
dictionary J-'m can be of different nature. The main examples are: 

(I) a collection J-m of basis functions used to approximate / in the non- 
parametric regression model as discussed above; these functions need not 
be orthonormal; 

(11) a vector of M one-dimensional random variables Z — {fi{X), . . . , fM{X)) 

as in linear regression; 
(111) a collection Tm of M arbitrary estimators of /. 

Case (III) corresponds to the aggregation problem: the estimates can arise, for 
instance, from M different methods; they can also correspond to M different 
values of the tuning parameter of the same method; or they can be computed on 
M different data sets generated from the distribution of (X, F). Without much 
loss of generality, we treat these estimates fj as fixed functions; otherwise one 
can regard our results conditionally on the data set on which they have been 
obtained. 

Within this framework, we use a data dependent £i-penalty that differs from 
the one described in l|1.2p in that the tuning parameter 7 changes with j as in 
[3 [6]. Formally, for any A = (Ai, . . . , Am) G define ^x{x) = Y.f^^ Xjfjix). 
Then the penalized least squares estimator of A is 

A = argmin J - - h{X^i)Y + pen(A) I , (1.3) 
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where 

M 

pen(A) = 2^ w„j|Aj| with uj^j = rn,M\\.fj\\n, (1-4) 
i=i 

where we write \\g\\n — ^7=i squared empirical L2 norm of 

any function g : X ^ R. The corresponding estimate of / is / = J2jLi ^jfj - The 
choice of the tuning sequence r„3f > will be discussed in Section [2l Following 
the terminology used in the machine learning literature (see, e.g., [21]) we call 
f the aggregate and the optimization procedure €i-aggregation. 

An attractive feature of f i-aggregation is computational feasibility. Because 
the criterion in (|1.3p is convex in A, we can use a convex optimization procedure 
to compute A. We refer to [ini for detailed analyzes of these optimization 
problems and fast algorithms. 

Whereas the literature on efficient algorithms is growing very fast, the one 
on the theoretical aspects of the estimates is still emerging. Most of the existing 
theoretical results have been derived in the particular cases of either linear or 
nonparametric regression. 

In the linear parametric regression model most results are asymptotic. We 
refer to [16| for the asymptotic distribution of Apon in deterministic design re- 
gression, when Af is fixed and n ^ 00. In the same framework, [28l [29j state 
conditions for subset selection consistency of Apon. For random design Gaussian 
regression, M = M(n) and possibly larger than n, we refer to [20 for consis- 
tent variable selection, based on Apcn. For similar assumptions on M and n, 
but for random pairs (Yi,Zi) that do not necessarily satisfy the linear model 
assumption, we refer to [12] for the consistency of the risk of Aiasso • 

The Lasso-type methods have also been extensively used in fixed design non- 
parametric regression. When the design matrix X]r=i ^i^'i ^^le identity matrix, 
(II. 2|) leads to soft thresholding. For soft thresholding in the case of Gaussian 
errors, the literature dates back to [^. We refer to [5] for bibliography in the 
intermediate years and for a discussion of the connections between Lasso-type 
and thresholding methods, with emphasis on estimation within wavelet bases. 
For general bases, further results and bibliography we refer to [T^]. Under the 
proper choice of 7, optimal rates of convergence over Besov spaces, up to log- 
arithmic factors, are obtained. These results apply to the models where the 
functions fj are orthonormal with respect to the scalar product induced by the 
empirical norm. For possible departures from the orthonormality assumption 
we refer to [S] [6] . These two papers establish finite sample oracle inequalities for 
the empirical error ||/ — and for the ^i-loss |A — A|i. 

Lasso-type estimators in random design non-parametric regression received 
very little attention. First results on this subject seem to be [HI [21]. In the 
aggregation framework described above they established oracle inequalities on 
the mean risk of /, for Aiasso corresponding to T = 1 and when M can be 
larger than n. However, this gives an approximation of the oracle risk with 
the slow rate -\/(logM)/n, which cannot be improved if Aiasso with fixed T is 
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considered [T4l[2T]. Oracle inequalities for the empirical error ||/ — f\\n and for 
the i?i-loss |A — A|i with faster rates are obtained for A = Apon in 6J but they 
are operational only when M < y^. The paper [15| studies somewhat different 
estimators involving the ^i-norms of the coefficients. For a specific choice of basis 
functions fj and with M < -Jn it proves optimal (up to logarithmic factor) rates 
of convergence of / on the Besov classes without establishing oracle inequalities. 
Finally we mention the papers [T71 [THl [57] that analyze in the same spirit as 
we do below the sparsity issue for estimators that differ from Apcn in that the 
goodness-of-fit term in the minimized criterion cannot be the residual sum of 
squares. 

In the present paper we extend the results of [6] in several ways, in particular, 
we cover sizes M of the dictionary that can be larger than n. To our knowl- 
edge, theoretical results for Apen and the corresponding / when M can be larger 
than n have not been established for random design in either non-parametric 
regression or aggregation frameworks. Our considerations are related to a re- 
markable feature of the i?i-aggregation: Apon, for an appropriate choice of the 
tuning sequence r„ m, has components exactly equal to zero, thereby realizing 
subset selection. In contrast, for penalties proportional to X^jli l-^il"' a > 1, 
no estimated coefficients will be set to zero in finite samples; see, e.g. [22] for 
a discussion. The purpose of this paper is to investigate and quantify when l\- 
aggregation can be used as a dimension reduction technique. We address this by 
answering the following two questions: "When does A G K*'^, the minimizer of 
(jl.Sp . behave like an estimate in a dimension that is possibly much lower than 
M?" and "When does the aggregate / behave like a linear approximation of 
/ by a smaller number of functions?" We make these questions precise in the 
following subsection. 

1.2. Sparsity and dimension reduction: specific targets 

We begin by introducing the following notation. Let 

M 

^W=E^{A.^o}=Card J(A) 
i=i 

denote the number of non-zero coordinates of A, where /{.i, denotes the indicator 
function, and J(A) = {j S {1, . . . , M} : Xj ^ 0}. The value M(A) characterizes 
the sparsity of the vector A: the smaller M{\), the "sparser" A. 

To motivate and introduce our notion of sparsity we first consider the simple 
case of linear regression. The standard assumption used in the literature on lin- 
ear models is E(F|X) ~ f{X) = AqX, where Aq G K*^ has non-zero coefficients 
only for j S J(Ao). Clearly, the £i-norm |Aols — Ao|i is of order M/^Jn, in 
probability, if Aqls is the ordinary least squares estimator of Aq based on all 
M variables. In contrast, the general results of Theorems 1, 2 and 3 below show 
that I A — Ao|i is bounded, up to known constants and logarithms, by Af (Aq) / \/n, 
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for A given by (|1.3I) . if in the penalty term (|1.4p we take r„^j\/ = Ay^ (log M)/n. 
This means that the estimator A of the parameter Aq adapts to the sparsity 
of the problem: its estimation error is smaller when the vector Aq is sparser. 
In other words, we reduce the effective dimension of the problem from M to 
M(Ao) without any prior knowledge about the set J(Ao) or the value M(Ao). 
The improvement is particularly important if A'/(Ao) <SC M . 

Since in general / cannot be represented exactly by a linear combination of 
the given elements fj we introduce two ways in which / can be close to such a 
linear combination. The first one expresses the belief that, for some A* £ M*^, 
the squared distance from / to /a* can be controlled, up to logarithmic factors, 
by M{\*)/n. We call this "weak sparsity". The second one does not involve 
M(A*) and states that, for some A* € MJ^ , the squared distance from / to 
fx* can be controlled, up to logarithmic factors, by n~^/^. We call this "weak 
approximation" . 

We now define weak sparsity. Let C/ > be a constant depending only on / 
and 

A = {A e M^^ : \\h -ff< Cfrl,,M{\)} (1.5) 

which we refer to as the oracle set A. Here and later we denote by || • || the 
L2(M)"i^orm: 

\\gf= [ g^{x)fiidx) 
Jx 

and by < /, 5 > the corresponding scalar product, for any /, g S L2{n)- 

If A is non-empty, we say that / has the weak sparsity property relative to 
the dictionary {/i, . . . , Jm}- We do not need A to be a large set: card(A) = 1 
would suffice. In fact, under the weak sparsity assumption, our targets are A* 
and /* = f^* , with 

A* - argmin {llfA - /|1 : AG R^' , M{\) - k*} 

where 

k* = min{Af(A) : A e A} 

is the effective or oracle dimension. All the three quantities, A*, /* and fc*, can 
be considered as oracles. Weak sparsity can be viewed as a milder version of 
the strong sparsity (or simply sparsity) property which commonly means that 
/ admits the exact representation / = ior some Ao e M^^ with hopefully 
smaU M(Ao). 

To illustrate the definition of weak sparsity, we consider the framework (I). 
Then ||fA — /|| is the approximation error relative to fA which can be viewed as 
a "bias term". For many traditional bases {fj} there exist vectors A with the 
first M(A) non-zero coefficients and other coefficients zero, such that ||fA — /|| < 
C(M(A))~'' for some constant C > 0, provided that / is a smooth function with 
s bounded derivatives. The corresponding variance term is typically of the order 
M{X)/n, so that if r„^M ~ n^^^^ the relation ||fA — f\\^ ^ rf^j^_jM{X) can be 

viewed as the bias- variance balance realized for A/(A) ^ n^s+i . We will need to 
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choose rn,M slightly larger, 



/log M 

rn,AI 



but this does not essentially affect the interpretation of A. In this example, the 
fact that A is non-void means that there exists A € R*^ that approximately 
(up to logarithms) realizes the bias-variance balance or at least undersmoothes 
/ (indeed, we have only an inequality between squared bias and variance in 
the definition of A). Note that, in general, for instance if / is not smooth, the 
bias- variance balance can be realized on very bad, even inconsistent, estimators. 
We define now another oracle set 

A' = {AeM*^ : ||fA-/|P <C}r„,M}. 

If A' is non-empty, we say that / has the weak approximation property relative 
to the the dictionary {/i, . . . , fm}- For instance, in the framework (III) related 
to aggregation A' is non-empty if we consider functions / that admit n~^^^- 
consistent estimators in the set of linear combinations f\, for example, if at 
least one of the /j's is n~^/^-consistent. This is a modest rate, and such an 
assumption is quite natural if we work with standard regression estimators fj 
and functions / that are not extremely non-smooth. 

We will use the notion of weak approximation only in the mutual coherence 
setting that allows for mild correlation among the /^'s and is considered in 
Section 12.21 below. Standard assumptions that make our finite sample results 
work in the asymptotic setting, when n oo and M — )■ cxd, are: 



/logM 

rn,M = A\ 

V n 

for some sufficiently large A and 



for some sufficiently small A' , in which case all A € A satisfy 

l|fA -./IP <c;.r„,M 

for some constant Cj- > depending only on /, and weak approximation fol- 
lows from weak sparsity. However, in general, rn,M and C fr'^ j^M{X) are not 
comparable. So it is not true that weak sparsity implies weak approximation or 
vice versa. In particular, C/r^ ^,jM(A) < r„_M, only if M{\) is smaller in order 
than yjn/ log(M), for our choice for r„_Af . 



1.3. General assumptions 



We begin by listing and commenting on the assumptions used throughout the 
paper. 
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The first assumption refers to the error terms Wi = Yi — f{Xi). We recall 
that f{X)=E{Y\X). 



Assumption (A1). The random variables Xi, . . . ,X„ are independent, identi- 
cally distributed random variables with probability measure fi. The random vari- 
ables Wi are independently distributed with 



ElexpdWil) I Xi, . . . ,Xn} < b for some finite b > and i = 1, . . . ,n. 

We also impose mild conditions on / and on the functions fj. Let \\g\\oo = 
^^PxgA" \9i^)\ bounded function g on X. 

Assumption (A2). (a) There exists < L < oo such that ||/j||oo < L for all 
l<j<M. 

(b) There exists cq > such that \\fj\\ > cq for all 1 < j < M. 

(c) There exists Lq < oo such that E[ff{X) f^{X)] < Lq for all l<i,j <M. 

(d) There exists < oo such that ||/||oo < i* < oo. 

Remark 1. We note that (o) trivially implies (c). However, as the implied bound 
may bc' too large, we opted for stating (c) separately. Note also that (a) and 
{d) imply the following: for any fixed A € M^, there exists a positive constant 
L(A), depending on A, such that ||/ — fA||cxi = -^(A). 

2. Sparsity oracle inequalities 

In this section we state our results. They have the form of sparsity oracle inequal- 
ities that involve the value M(A) in the bounds for the risk of the estimators. 
All the theorems are valid for arbitrary fixed n > 1, M > 2 and r„,M > 0. 

2.1. Weak sparsity and positive definite inner product matrix 

The further analysis of the £i-aggregate depends crucially on the behavior of 
the M X M matrix given by 



In this subsection we consider the following assumption 

Assumption (A3) . For any M >2 there exist constants km > such that 



E{Wi\X,,...,Xn} = 



and 




*M - Km diag(*M) 



is positive semi-definite. 
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Note that < km < 1- We will always use Assumption |(A3)] coupled with 
|(A2)[ Clearly, Assumption |(A3)| and part (b) of |(A2)| imply that the matrix 
\1/m is positive definite, with the minimal eigenvalue r bounded from below by 
cqKm- Nevertheless, we prefer to state both assumptions separately, because this 
allows us to make more transparent the role of the (potentially small) constants 
Co and km in the bounds, rather than working with r which can be as small as 
their product. 

Theorem 2.1. Assume that \(Al)\ ' \(A3)\ hold. Then, for all X e A we have 
F {ll/ - /IP < B,KllrlMM{X)} > 1 - tt^mW 

and 

P {|A - All < B2Kl^rnMMiX)} > 1 - ^„,Af(A) 
where Bi > and B2 > are constants depending on cq and C / only and 

r\\ ^ ^ni\T2 ( • / 2 TnM 1 l^ll HM \\ 

f M{X) 2 A 
+ exp ( -C2-^-^nr„ I , 

for some positive constants Ci,C2 depending on CQ^Cf and b only and L{X) = 

\\f~h\\oo- 

Since we favored readable results and proofs over optimal constants, not too 
much attention should be paid to the values of the constants involved. More 
details about the constants can be found in Section H) 

The most interesting case of Theorem 12 . II corresponds to A = A* and M{X) = 
M(A*) = k*. In view of Assumption | ( A2)] we also have a rough bound L{X*) < 
L* + i|A*|i which can be further improved in several important examples, so 
that M(A*) and not |A*|i wiU be involved (cf. Section [3|). 

2.2. Weak sparsity and mutual coherence 

The results of the previous subsection hold uniformly over A g A, when the 
approximating functions satisfy assumption |(A3)] We recall that implicit in the 
definition of A is the fact that / is well approximated by a smaller number of the 
given functions fj. Assumption ] (A3)] on the matrix "^m is, however, independent 
of/. 

A refinement of our sparsity results can be obtained for A in a set Ai that 
combines the requirements for A, while replacing |(A3)| by a condition on 
that also depends on Af(A). Following the terminology of ^8 , we consider now 
matrices m with mutual coherence property. We will assume that the correla- 
tion 

/■ -N < hJj > 
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between elements i j is relatively small, for i e J(A). Our condition is some- 
what weaker than the mutual coherence property defined in [5] where all the 
correlations for i ^ j are supposed to be small. In our setting the correlations 
PM{i,j) with i,j ^ J(A) can be arbitrarily close to 1 or to —1. Note that such 
pM{ijj) constitute the overwhelming majority of the elements of the correlation 
matrix if J(A) is a set of small cardinality: M(A) ^ M. 
Set 

p(A) = max max|pM(«,j)|- 

With A given by (jl.Sp define 

Ai = {A e A : p(A)M(A) < 1/45}. (2.1) 

Theorem 2.2. Assume that \(Al)\ and \(A2\ hold. Then, for all A e Ai we have, 
with probability at least 1 — 7rn,j\/(A), 

\\f-fr<Crl^tM{X) 



|A-A|i <Cr„,MM(A), 

where C > is a constant depending only on cq and Cf, and 7fn,M(A) is defined 
o-s 7r„_M(A) in Theorem \2.1\ with km = 1. 

Note that in Theorem l2.2l we do not assume positive definiteness of the matrix 
^M- However, it is not hard to see that the condition p(A)M(A) < 1/45 imphes 
positive definiteness of the "small" M(A) x M(A)-dimensional submatrix (< 

fiJj >) ijeJ(A) of -^M- 

The numerical constant 1 /45 is not optimal. It can be multiplied at least by 
a factor close to 4 by taking constant factors close to 1 in the definition of the 
set E2 in Section U) The price to pay is a smaller value of constant ci in the 
probability 7rn,j\/(A). 

2.3. Weak approximation and mutual coherence 

For A' given in the Introduction, define 

A2 = {A e A' : p(A)M(A) < 1/45} . (2.2) 
Theorem 2.3. Assume that \(Al]\ and l(A2)\ hold. Then, for all A e A2, we have 

P [11/ - ff + rn,M\^ - Ml <C'{\h- /IP + rlMMiX)}] > 1 ~ <,m(A) 
where C > is a constant depending only on cq and Cp and 



(A) < UAf^'exp -c[n min ■ 



^,M{X) 

for some constants c[ , c'2 depending on cq , Cj and b only. 



-exp [ -C2jY^nr^M 
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Theorems 12.11 - 12.31 are non-asymptotic results valid for any Vn^M > 0. If we 
study asymptotics when n — > cxd or both n and M tend to oo, the optimal choice 
of Tn^M becomes a meaningful question. It is desirable to choose the smallest 
Tn,M such that the probabilities TTn,M,Ti'n,M,Tr'^ m tend to (or tend to at a 
given rate if such a rate is specified in advance). A typical application is in the 
case where n oo, M = Mn oo, ki\i (when using Theorem 1 2. 1|) . Lq, L, L{X*) 
are independent of n and M, and 

n 

oo, as n ^ oo. (2.3) 



M2(A*)logM 

In this case the probabilities TTn,M, Ti'n,M,Tr'n m tend to as n — > oo if we choose 



logAf 

Tn.M = A\ 

for some sufficiently large A > Q. Condition (|2.3p is rather mild. It implies, 
however, that M cannot grow faster than an exponent of n and that M{\*) = 

0{y/E). 



3. Examples 

3.1. High- dimensional linear regression 

The simplest example of application of our results is in linear parametric regres- 
sion where the number of covariates M can be much larger than the sample size 
n. In our notation, linear regression corresponds to the case where there exists 
A* G M*^ such that f — Then the weak sparsity and the weak approxima- 
tion assumptions hold in an obvious way with C/ = = 0, whereas L{X*) = 0, 
so that we easily get the following corollary of Theorems 12.11 and 12.21 

Corollary 1. Let f — for some A* G R^'-^ . Assume that \(AlJ\ and items (a) 
- (c) of lfMH hold. 

(i) If \(A3)\ is satisfied, then 

P{(A- A*)'vI/m(A- A*) < B,KllrlMM{X*)} > 1 - <m (3.1) 

and 

f{|A-A*|i <S2KA/'^n,MM(A*)} >1-<M (3.2) 
where Bi > and B2 > are constants depending on cq only and 

< lOM exp (^-cmmm |r„,,„ -, ^^^^^^^^ ' Zi^^j j 



for a positive constant c\ depending on cq and b only. 
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(ii) If the mutual coherence assumption p(A*)M(A*) < 1/45 is satisfied, then 
i3.1\) and US. 2\) hold with km = 1 and 

for a positive constant C\ depending on Cq and b only. 

Result p.2p can be compared to .7J which gives a control on the £2 (not ii) 
deviation between A and A* in the linear parametric regression setting when AI 
can be larger than n, for a different estimator than ours. Our analysis is in several 
aspects more involved than that in because we treat the regression model 
with random design and do not assume that the errors Wi are Gaussian. This 
is reflected in the structure of the probabilities tt* j^j . For the case of Gaussian 
errors and fixed design considered in [7], sharper bounds can be obtained (cf. 
0). 

3.2. Nonparametric regression and orthonormal dictionaries 

Assume that the regression function / belongs to a class of fmictions T described 
by some smoothness or other regularity conditions arising in nonparametric es- 
timation. Let Tm — {/i; . . . , /a/} be the first M functions of an orthonormal 
basis Then / is an estimator of / obtained by an expansion w.r.t. to 

this basis with data dependent coefficients. Previously known methods of ob- 
taining reasonable estimators of such type for regression with random design 
mainly have the form of least squares procedures on J-" or on a suitable sieve 
(these methods are not adaptive since T should be known) or two-stage adap- 
tive procedures where on the first stage least squares estimators are computed 
on suitable subsets of the dictionary Tm] then, on the second stage, a subset 
is selected in a data-dependent way, by minimizing a penalized criterion with 
the penalty proportional to the dimension of the subset. For an overview of 
these methods in random design regression we refer to |3] , to the book [I3| and 
to more recent papers [4j [15] where some other methods are suggested. Note 
that penalizing by the dimension of the subset as discussed above is not always 
computationally feasible. In particular, if we need to scan all the subsets of a 
huge dictionary, or at least all its subsets of large enough size, the computa- 
tional problem becomes NP-hard. In contrast, the i?i-penalized procedure that 
we consider here is computationally feasible. We cover, for example, the case 
where JF's are the io(-) classes (see below). Results of Section [2] imply that 
an ^i-penalized procedure is adaptive on the scale of such classes. This can be 
viewed as an extension to a more realistic random design regression model of 
Gaussian sequence space results in [HITT]. However, unlike some results obtained 
in these papers, we do not establish sharp asymptotics of the risks. 

To give precise statements, assume that the distribution /i of X admits a 
density w.r.t. the Lebesgue measure which is bounded away from zero by /imin > 
and bounded from above by ^max < 00. Assume that Tm = {/i, ■ • • , /a/} is an 
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orthonormal system in L2{X^dx). Clearly, item (b) of Assumption (A2) holds 
with Co — /imin, the matrix \I'm is positive definite and Assumption (A3) is 
satisfied with km independent of n and M . Therefore, we can apply Theorem 
12.11 Furthermore, Theorem 1 2 . II remains valid if we replace there || • || by || • ||Leb 
which is the norm in L2{X, dx). In this context, it is convenient to redefine the 
oracle A* in an equivalent form: 

A* = argminjllf;, - /|lLeb : A G M{\) = k*] (3.3) 



with k* as before. It is straightforward to see that the oracle (|3.3p can be ex- 
pHcitly written as A* = (A^ , . . . , A^,^) where A* =< fj,f >Lcb if | < fjj >Lob | 
belongs to the set of k* maximal values among 

I < /l, / >Lcb I , • • ■ , I < /a/, / >Lob I 

and A* = otherwise. Here < •, • >Lcb is the scalar product induced by the 
norm || • ||Lcb- Note also that if ||/||oo < we have L{\*) — 0{M{\*)). In fact, 
L{y) = ll/-fA*|l < L^+L\\*\i, whereas 

|A*|i < M(A*) max | < /„ / >Leb | < max | < /„ / > | 

1<J<A/ /i,nin l<j<Af 

^ M{\*)L^L 

Mmin 

In the remainder of this section we consider the special case where {/j}j^o 
is the Fourier basis in L2[0, 1] defined by fi{x) = 1, f2k{x) = \/2 cos(27rfca;), 
f2k+i{x) = 1/2 sin(27rfca;) for k — 1,2,..., x € [0,1], and we choose rn,M = 



^^n^- brevity 9j =< fj,f >Lob and assume that / belongs to the 

class 

Lo{k) - {/ : [0, 1] ^ M : Card{j : 6^ ^ 0} < k} 
where k is an unknown integer. 



Corollary 2. Let Assumption (Al) and assumptions of this subsection hold. 
Let 7 < 1/2 he a given number and M < n" for some s > 0. Then, for rn,M = 



Ay with A > large enough, the estimator f satisfies 

sup P(||./-/l|' < feiA^ f^^H > 1-^-"% Vfc<n^ (3.4) 
/6io(fe) I ' \ ri J j 

where bi > is a constant depending on /imin o-nd /imax only and 62 > is a 
constant depending also on A, 7 and s. 

Proof of this corollary consists in application of Thcorcm l2.1l with M(A*) = k 
and L{X*) = where the oracle A* is defined in (|3.3p . 

We finally give another corollary of Theorem 12. II resulting, in particular, in 
classical nonparametric rates of convergence, up to logarithmic factors. Consider 
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the class of functions 

C30 

^={/:[0,l]^K: Y.\(^,\<l} (3.5) 

where Z > is a fixed constant. This is a very large class of functions. It 
contains, for example, all the periodic Holderian functions on [0,1] and all the 

Sobolev classes of functions = |/ : [0,1] ^ M : J^'^^'J < Q] with 

smoothness index /5 > 1/2 and Q = Q{L) > 0. 

Corollary 3. Let Assumption \ (Al^ and assumptions of this subsection hold. 
Let M < for some s > 0. Then, for r„.M ~ ^\f^^ with ^ > large 
enough, the estimator f satisfies 

\f-ff<bJ^^^\M{\*)\>l^^A\*), yfeT, (3.6) 



where A* is defined in \3. S\) . > is a constant depending on /^min o-nd /imax 
only and 

7r„(A*) < n-''^ + eM-b5nAr^{X*)) 

with the constants 64 > and 65 > depending only on firnin, /^max? A, L and 
s. 

This corollary implies, in particular, that the estimator / adapts to unknown 
smoothness, up to logarithmic factors, simultaneously on the Holder and Sobolev 
classes. In fact, it is not hard to see that, for example, when / S ^73 with /3 > 1/2 
we have M{X*) < M„ where M„ - (n/log?i)i/(2/3+i). Therefore, Corollary 
[3] implies that / converges to / with rate {n/ \ogn)^^/'^^^^^\ whatever the 
value P > 1/2, thus realizing adaptation to the unknown smoothness f3. Similar 
reasoning works for the Holder classes. 



4. Proofs 

4.I. Proof of Theorem 1 

Throughout this proof A is an arbitrary, fixed element of A given in (|1.5p . Recall 
the notation f^ — X^^ii ^jfj- We begin by proving two lemmas. The first one is 
an elementary consequence of the definition of A. Define the random variables 

1 " 

i=l 

and the event 

M 

E^=[]{2\Vj\<UJr.^,}. 
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Lemma 1. On the event Ei, we have for all n > 1, 

M 

i=l JGJ(A) 

Proof. We begin as in T9]. By definition, / = satisfies 

M M 

S{\) + Y,2oJnJ%\ < 5(A)+5]2u;„,,|A,| 
for all A e M*^ which we may rewrite as 

M M n 

11/ - fWl + J2 2'^n,.|A,l < llfA - /II,'. + Y.2cj.r,J\,\ + Wdl- h){X,). 

j=l j = l i=l 

If El holds we have 

n M M 

and therefore, still on Ei, 

M M M 

\\f- fwi < \\fx- fwi + E^».^i^^- -^j\+T. - E 2^"..iA.i- 

Adding the term X]j=i '^n j I Aj ~ Aj | to both sides of this inequality yields further, 
on El, 

M 



\\f-f\\l+J2^n,j\X,-\,\< 
J = l 
M 

\\h - fWl + 2 E ^" J I Aj - Aj I + E 2^».J- I^J I - E J I^J 



A/ M M 



Recall that J{X) denotes the set of indices of the non-zero elements of A, and 
that M(A) = Card J(A). Rewriting the right-hand side of the previous display, 
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we find that, on Ei, 

M 



M 



< iifA - fwi + J2 - - E 

jeJ(A) 

by the triangle inequahty and the fact that Xj = for j ^ J(A). □ 

The following lemma is crucial for the proof of Theorem 1. 
Lemma 2. Assume that lfAlJl - \(A3)\ hold. Define the events 

E2 = [l\\f,r<m\i<nf,r. j-i,...,a/} 

and 

EsiX) = - fWl < 2\\h - fr + rlMM{X)} . 
Then, on the set EiH E2 D i?3(A), we have 

\\f-f\\l+'-^\X-X\,< (4.2) 



2||fA - /IP + rl,iM{X) + 4r„.M^^-^^||7- fA||. 
Proof. Observe that assumption |(A3)| implies that, on the set E2, 

M 

Y <,|A,-A,f < ^<^.|A,-A,f 

< 2r2^M(A-A)'diag(vI/M)(A-A) 

< ^|l/-fA|P. 

KM 

Applying the Cauchy-Schwarz inequality to the last term on the right hand side 
of (|4.ip and using the inequality above we obtain, on the set EiCi E2, 



M 



11/ - fWl + Y^n,,\X, - A,| < ||f, - fWl + 4r„,M 1/^^11/ - fAll. 

Intersect with £"3 (A) and use the fact that uinj > coVn^M / on E2 to derive 
the claim. □ 
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Proof of Theorem 1. Recall that A is an arbitrary fixed element of A given in 
(fL5l). Define the set 



U{X) = ||f^||> v/M(A)}n 



ImIi < ^ I (2C/ + l)r„,MM(A)+4^ 



/2M(A)| 

KM 



sup 

ij.eu{\) 



If^lln 



llf,.|P 



- 2 



and the event 

We prove that the statement of the theorem holds on the event 
E{\) := EiC\E2C^ E^iX) n E^iX) 

and we bound P [{£^(A)}'-^] by 7r„^M(A) in Lemmas O [Hand [7] below. 

First we observe that, on E{X) n {||/ — fA|l < rn^M\/M{X)}, we immediately 
obtain, for each A G A, 



11/ -/II < ||fA-/|| + ||fA-/|| 



(4.3) 



< \\h- f\\+rn,MVMW 

< (l + Cy')r„,Af0WXA) 



since ||fA — /||^ < C'/r^ ^,jA/(A) for A e A. Consequently, we find further that, 
on the same event E{X) n {||/ — fA|| < fn,A/-\/A/(A)}, 



11/ - /II 2 < 2(1 + Cf)rlMM{X) C\rlMM{X) < C, 



KM 



since < km < 1- Also, via (|4.2[) of Lemma [2] above 
\X-X\,<^{2V2Cf + V2 + 8}''-''^'^^^ 



/KM 



^ rn.MM{X) ^ ^ rn^MM{X) 
O2 ; S (--2 ■ 



/KM 



KM 



To finish the proof, we now show that the same conclusions hold on the event 
E{X) n |||/ - fAll > r„,MA/M(A)|. Observe that A - A e C/(A) by Lemma H 
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^Wf-hf < \\f-h\\l 



Consequently 

(by definition of i?4(A)) 

< 2||/-fA||^ + 2||7-/||2 

< 4||/-f,|p + 2<^,M(A) + 2||/-/||2 
(by definition of i?3(A)) 

< 4\\f-hr + 2rl,jM{X)+ (4.4) 



KM 



(by Lemma [2]) 



< 811/ - f.ir + 4<,,M(A) + 4M,M^ + ^11/ - fAlP 

Km 4 



using 2a;?/ < 4a;^ +j/^/4, with .t = 4r„3f -^Z 2M(A) / km and y = ||/ — fA||- Hence, 
on the event E{X) H {||/ — fA|| > rn^MM{X)}, we have that for each A e A, 



11/ - fAll < 4{ V2C7 + 6K,M J (4.5) 



Km 

This and a reasoning similar to the one used in (|4.3p yield 



11/ - /II' < {(1 + 4x/2)v/c7+ 6} Kllrl^,M{X) =: 



3" 



Km 

Also, invoking again Lemma [2] in connection with (j4.5p we obtain 

|A _ All < y^{2Cj + 1 + 327c7+24^/2} ""^^^^^^ cj-^^^^^. 

Co KA/ KM 

Take now Si ^CiVCgand B2 =C2VC4to obtain ||7-/||2 < Bik^V^_mM(A) 

and |A — A|i < B2K'^lrn,MM{\). The conclusion of the theorem follows from 
the bounds on the probabilities of the complements of the events Ei, E2, i?3(A) 
and i?4(A) as proved in Lemmas HI [5l [S] and [7] below. □ 



The following results will make repeated use of a version of Bernstein's in- 
equality which we state here for ease of reference. 

Lemma 3 (Bernstein's inequality). Let Ci, . . . , ^„ be independent random vari- 
ables such that 

1 " I 
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for some positive constants w and d and for all integers m > 2. Then, for any 
e > we have 

F||:(C:-EO)>-}<exp(-^^-^). (4.6) 



Lemma 4. Assume that \(Al)\ and \(A2)\ hold Then, for all n > 1, M > 2, 

P(i?f)<2Mexp(^-^). (4.7) 

Proof. The proof follows from a simple application of the union bound and 
Bernstein's inequality: 



(E^) < M max 
^- ^ ~ l<j<M 



^ll./,lP>ll/,ll,'.}+P{ll/,ll?.>2||/,|P}) 



< Mexp(-^J+Mexp^-^), 

where we applied Bernstein's inequality with uP' — and d = LP' and 

with £ = 5 II /j IP for the first probability and with e = ||/j|P for the second 
one. □ 

Lemma 5. Let A ssumptions \(Al)\ and \(A2)\ hold. Then 

P {{E^ n i?2}^) < 2M exp f-^^) + 2M exp (- "'■"^^''^'^ 



166 / 'V 

-2Mexp ( -- 



I2L2) ■ 

Proof. We apply Bernstein's inequality with the variables Ci = Ci,i — fj{^i)^ii 
for each fixed j S {1, . . . , M} and fixed Xi, . . . , X„. By assumptions |(Al)| and 



(A2) we find that, for m > 2, 



^ n 1 ^ 

- VE{icMri^i,-.-,^n} < L"-'-V/j(x.)iE{iw-»ri^i,...,^4 

i— 1 i— 1 

Using (|4.6p . with e = LOnj/2, w — Vb\\fj\\n, d = L, the union bound and the 
fact that 

exp{-x/(a + /3)} <exp{-x/(2a)} + exp{-x/(2/3)}, Va;,a,/3>0, (4.8) 
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we obtain 



[E'{\Xi,...,Xn) < 2 2^exp' 



V 2(&||/,||2+Lr„,M||/,||„/2); 

This inequality, together with the fact that on E2 we have |l/j|l„ > \\fj\\/V2> 
co/^/2, implies 



Combining this with Lemma S] we get the result. □ 
Lemma 6. Assume that \(Al )\ and \(A2)\ hold. Then, for all n> 1, M >2, 



[{Es{X)}^] < exp 



4L2(A) 



Proof. Recall that Hf^ — f\\oo = L{X). The claim follows from Bernstein's in- 
equality applied with e = \\h - /|p + ^,^M(A), d = i^(A) and w"^ = \\fx — 

Lemma 7. Assume l(AT)\ - \(A3)\ Then 
P [{E,iX)}C] < exp (- ^^^J^,^,^,^ ) + 2M^ exp (- 



8L^CM{X) 



where C = 2c^^ (2Cf + 1 + ^ • 

Proof. Let 

1 " 

V^M(i,j)=E[/,(X)/,(X)] and V„,M(i,j) = -^/,(Xfe)/,(Xfc) 



n 
fe=i 



denote the (i, j)th entries of matrices '^m and 4'n,Af, respectively. Define 

Vn.M ^ max IV'Afl*, j) - V'n.iv/Ci, 

l<ij,<A/ 
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Then, for every /i £ U{X) we have 

2 



< WlP^ \ + l)r„,MM(A) + 4W^^||f,,|l J. 7y„,M 



< 



^|(2Q + 1)VM(A)+4./^ 



2 

,M 



.M 



= l|(2C^ + l) + 4,/A\ M(AK 

Co I \ KM j 

Using the the last display and the union bound, we find for each A G A that 

P[{^4(A)}^] < P[r7„,M > 1/{2CM(A)}] 

< 2A/2 max p [IV'mIz, j) - ^„,M(i, j)| > l/{2CAf(A)}] . 

l<z,j<M 

Now for each the value 'i/'M(*,i) ~ 4'n,Mihj) is a sum of n i.i.d. zero 

mean random variables. We can therefore apply Bernstein's inequality with 
= fi{Xk)fj(Xk), e = 1/{2CM(A)}, w"^ ^ Lq, d ^ l? and inequality dH]) to 
obtain the result. □ 



4.2. Proof of Theorem [Q 



Let A be an arbitrary fixed element of Ai given in (12. ip . The proof of this 
theorem is similar to that of Theorem 1. The only difference is that we now 
show that the result holds on the event 

E{X) := EinE2n EsiX) n £;4(A). 

Here the set i?4(A) is given by 



^4 (A) 



where 
L^(A) 



{ 



MGK*^: llf^l 



1 sup 


\\^,r-\\^,\\i 








/M(A)} n 





e R*^ ImIi < — {{2Cf + l)r„,MM(A) + 8 yM(A) 1 1 f^ 1 1 ) 
Co V / 
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We bounded P {{Ei n £2}'^') and P [{EsiX)}'^] in Lemmas M and [6] above. 

{E4{X)}'~^ is obtained exactly as in Lemma [7] but now with 



The bound for P 



Ci = 8cq ^ (2C/ + 9)^, so that we have 



{E,{X)Y 



< 2AP exp 



2M^ exp 



The proof of Theorem!^ on the set ^(A) n |||/ - fA|| < r^^M 



8L2CiM(A) 

is iden- 



tical to that of Theorem |2JJ on the set i?(A)n|||/ - fA|| < Vn^M 
on the set £'(A) n |||/ - fA|| > r^^M 



Next, 

we follow again the argument of 



Theorem 12.11 first invoking Lemma [8] given below to argue that A — A e U{X) 
(this lemma plays the same role as Lemma [2] in the proof of Theorem 12. ip and 
then reasoning exactly as in (|4.4p . □ 

Lemma 8. Assume that (Al) and \(A2)\ hold and that X is an arbitrary fixed 
element of the set {X e M*^ : p{X)M{X) < 1/45}. Then, on the set n £'2 fl 
E^IX), we have 



||/-/||?, + ^|A-A|i< 



(4.9) 



2||fA - /IP + rlMM{X) + 8r„,MyM(A)ll/ - fAll- 



Proof. Set for brevity 



M 



p = piX), u^^Xj~Xj, a^'^WfjlWujl, a{X) ^ ^ WfjWWj 

i=i ieJ(A) 

By Lemma 1, on i?i we have 

M 

||/-/||^^ + ^^„,,>,| < ||fA-/||2+4 ^ cj^^,\u,\. 



(4.10) 



Now, on the set E2, 



4 ^ ujnj\uj\ < 8rnMa(X) < %r^,M^jM{X) \\f 



J II 



(4.11) 



ieJ(A) 



Here 



- 2 E E < ^3 > "'"j ~ EE < > '"^"j 

j^.7(A)jeJ(A) i^]eJix).t^] 

< \\f-hr + 2p J2 \\M\^^\ E ll/.IIKI + P«'(A) 



j^j(A) je./(A) 

2 I r,„„l-\\„ 2/ 



||/-fA|l^ + 2pa(A)a-pa^(A) 
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where we used the fact that j^j{X) < fi^ fj > '^i^j ^ 0- Combining this 

with the second inequahty in (|4.11|) yields 

a'(A) < Af (A) {11/ - fAlP + 2pa{X)a - pa\X)} 

which imphes 

2pM{X)a ^ImWf-hW 

l + pAf(A)+ l + pM(A) ■ ^^■^^> 
^From (|4.10p . (I4.12p and the first inequahty in (|4.1ip we get 



M 



\? fii2 , I I ^ iif fii2 , 16pM(A)r„,A/a 

|/-/L + ^^^„,,|u,| < llfA-/ll„+ ^^pM{X) 



8r„,MVM(A)ll/-fA|| 
1 + /3Af(A) 

Combining this with the fact that r„_Af ||/j|| < v^tiJnj on E2 and pM{X) < 1/45 
we find 



1 A/ ^ 

/-/ll' + 2E^».^K-| ^ ||fA-/||^ + 8r„.Mx/M{A)ll/-fA||. 



Intersect with E^IX) and use the fact that uinj > corn,]\f / on £^2 to derive 
the claim. □ 



4.3. Proof of Theorem [Ol 

Let A € A2 be arbitrary, fixed and we set for brevity = 1. We consider 
separately the cases (a) ||fA - /f < j^M(A) and (b) ||fA - /f > r2_j,,,M(A). 

Case (a). It follows from Theorem 12.21 that 

11/ - ff + r„,M|A - All < Crl,,MiX) < C {r^ mA^IA) + ||fA - /f } 
with probability greater than 1 — 7r„ ^(A). 

Case (b). In this case it is sufficient to show that 

||7- /IP + r„,Af|A - All < C'ljfA - /ir, (4.13) 

for a constant C" > 0, on some event £;'(A) with ¥{E'{X)} > 1 - <^m(A)- We 
proceed as follows. Define the set 

U'{X) = {mGM*^: llf^ll >||fA-/||, 

|/i|i<^^(3||fA-/|P + 8||fA-/||.||f^||) I 

Corn,M 
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and the event 

^'(A) := EinE2n EsiX) n E^iX), 

where 



We prove the result by considering two cases separately: ||/ — fA|| < II^a ^ /|| 
and ||7-fA|| > ||fA-/||. 

On the event {||/ — fA|| < ||fA — /||} we have immediately 

||7- /IP < 2||7- hf + 2||fA - /IP < 4||fA - /Ip. (4.14) 

Recall that being in Case (b) means that ||fA — /|p > j^^M{X). This coupled 

with (I4.14P and with the inequality ||/ — fA|| < l|fA ~ /|| shows that the right 
hand side of (|4.9p in Lemma [8] can be bounded, up to multiplicative constants, 

by ||fA - /IP- Thus, on the event E'{X) n |||/ - fA|| < ||fA - /||} we have 

r„,Af|A-A|i <C||fA-/|P, 

for some constant C > 0. Combining this with (|4.14|) we get (|4.13p . as desired. 

Let now \\f- fx\\ > \\fx - f\\. Then, by LemmaO we get that A - A € U'{X), 
on i?i n -E2 n Es{X). Using this fact and the definition of i?5(A), we find that on 

^'(A)n{||/-fAl| > ||fA-/||} we have 

^||,/-fA|P<||/-fA|P.. 

Repeating the argument in (|4.4p with the only difference that we use now Lemma 
[5] instead of Lemma[2]and recalling that ||fA — /|P > ^,jM(A) since we are in 
Case (h), we get 

11/ - fAlP < C{tImM{X) + llfA - /IP) < C"||fA - /IP (4.15) 

for some constants C > 0, C" > 0. Therefore, 

||7- /IP < 2||7- fAlP + 2||fA - /IP < (2C" + l)||fA - /IP. (4.16) 

Note that (|4T5)) and ((4T6)) have the same form (up to multiplicative constants) 
as the condition ||/ — fA|| < ||fA ^ ./II and the inequality (|4.14p respectively. 
Hence, we can use the reasoning following (|4.14p to conclude that on E'{X) D 

{||7- ^aII > IjfA - /ll} inequality ([4T3| holds true. 

The result of the theorem follows now from the bound P [{£"(A)}'^] < 
""n AiW which is a consequence of Lemmas EJ |6] and of the next Lemma |9l 

□ 



E,{X) = \ sup 
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Lemma 9. Assume l(Al)\ and \(A2)\ Then, for all n > 1, M > 2, 



where C = 8- H^Cq^ 

Proof. The proof closely follows that of Lemma [T] Using the inequality \\fx 
/II ^ < rn,M, we deduce that 

P[{i?5(A)}^] < pj^n.M^^||fA-/ir>^| 

8-112 1 

< P 7?„,M^ >t; 

An application of Bernstein's inequality with = fi(Xk)fj{Xk), s — rn^M / (2C), 
■u;2 = Lq and d = completes the proof of the lemma. □ 
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